Textual Inversion

画像生成AIのStable Diffusionに数枚の画像を学習させ、AIモデル全体を再学習させて調整(ファインチューニング)を行う手法

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new "words" in the embedding space of a frozen text-to-image model. These "words" can be composed into natural language sentences, guiding personalized creation in an intuitive way. Notably, we find evidence that a single word embedding is sufficient for capturing unique and varied concepts.

CLIP LはSDXLでもSD3.5でも使われているから、LoRAと違ってStable diffuion 1.5からの資産を一応引き継げるのかnomadoor.icon

SDXL以降は複数のテキストエンコーダを併用してるから効果は薄まるのだろうけど